Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms
نویسندگان
چکیده
The Berkeley Data Analytics Stack (BDAS) is an emerging framework for big data analytics. It consists of the Spark analytics framework, the Tachyon in-memory filesystem, and the Mesos cluster manager. Spark was designed as an in-memory replacement for Hadoop that can in some cases improve performance by up to 100X. In this paper, we describe our experiences running BDAS on the new Cray Urika-XA extreme analytics platform, on Cray XC systems, and on a prototype Aries-based system with node-local SSDs. We discuss how we configured and optimized the BDAS stack, and describe the execution environment used on each platform. BDAS applications differ significantly from traditional HPC applications: they run in the Java Virtual Machine, and communicate via TCP/IP. We explore how Cray system capabilities, such as the Aries interconnect and SSDs, can be better leveraged to improve performance of these types of applications. Keywords-Spark; Tachyon; Berkeley Data Analytics Stack; Urika-XA; Cray XC; data analytics; big data
منابع مشابه
My Cray can do that? Supporting Diverse Workloads on the Cray XE-6
The Cray XE architecture has been optimized to support tightly coupled MPI applications, but there is an increasing need to run more diverse workloads in the scientific and technical computing domains. These needs are being driven by trends such as the increasing need to process “Big Data”. In the scientific arena, this is exemplified by the need to analyze data from instruments ranging from se...
متن کاملComparison of Diierent Computer Platforms for Running the Versatile Advection Code Comparison of Diierent Computer Platforms for Running the Versatile Advection Code
The Versatile Advection Code is a general tool for solving hydrodynamical and magnetohydrodynamical problems arising in astrophysics. We compare the performance of the code on diierent computer platforms, including work stations and vector and parallel supercom-puters. Good parallel scaling can be achieved with the data parallelism expressed in High Performance Fortran. With the aid of the auto...
متن کاملA Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملEfficiency Evaluation of Cray XT Parallel IO Stack
PetaScale computing platforms need to be coupled with efficient IO subsystems that can deliver commensurate IO throughput to scientific applications. In order to gain insights into the deliverable IO efficiency on the Cray XT platform at ORNL, this paper presents an in-depth efficiency evaluation of its parallel IO software stack. Our evaluation covers the performance of a variety of parallel I...
متن کاملAn Investigation on the User Behavior in Social Commerce Platforms: A Text Analytics Approach
Nowadays, the tourism industry accounts for approximately 10% of the global GDP, while it only contributes 3% of the economy in Iran. Since the pressure of US sanctions increases day after day on the Iranian economy, the necessity of paying attention to this industry as a source of foreign currency is felt more than ever. The purpose of this research is to analyze the reviews of users of social...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015